Red Wine Quality EDA by Adrien Viani

Citation Request

This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

Description of Paper and Data

This EDA paper focuses on the effect of various red wine characteristics on perceived quality. The dataset and information related to it is available in the citation above. The 12 variables explored in this paper are listed below with a brief description.

Description of attributes:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of le vels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 gr ams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between mole cular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth a nd the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentr ations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the per cent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)

Units of Attributes:
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3)
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
Output variable (based on sensory data):
12 - quality (score between 0 and 10)

Univariate Plots Section

In this section, I will plot a boxplot and histogram for each variable laid out above, discuss the qualities of the distribution, and possibly log transform the variables to better approximate a normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Quality appears roughly normally distributed. Of note is that in our dataset it appears that the bulk of the results fall within the range 5-6. The relative sparseness of low quality and high quality wines in the sample could present a balancing issue in any predictive analysis pursued down the line.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Fixed acidity is approximately normal with a slight right skew and potential outliers at the end of the tail. The bulk of the falls betwee n 7.1-9.2 (tartaric acid - g / dm^3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Volatile acidity also appears approximately normal with a slight right skew an d potential outliers at the end of the tail. The bulk of the results fall betwee n .39 and .64 (acetic acid - g / dm^3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Citric acid does not appear to be normally distributed. There are peaks in frequency at 0, and 0.5, with other less stark peaks at approximately .14 and .2 (g / dm^3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Residual sugar is right skewed with a very long tail. The bulk of residual sugar data lies between .27 and .36 (g / dm^3). After a log transform,shown below, it appears to be closer to normally distributed albeit with some right skew remaining.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.04576  0.27875  0.34242  0.36925  0.41497  1.19033

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Chlorides is right skewed with a very long right tail. The bulk of chlorides d ata lies between .07 and .09 (g / dm^3). After a log transform,shown below, it appears to be closer to normally distributed albeit long, skinny tails

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.921  -1.155  -1.102  -1.088  -1.046  -0.214

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Free sulfur dioxide is right skewed. The bulk of the data lies between 7 and 21 (g / dm^3). After a log transform it does not appear normally distributed, and actually looks bimodal.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.8451  1.1461  1.1058  1.3222  1.8573

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Total sulfur dioxide is right skewed with a very long right tail. The bulk of total sulfur dioxide data lies between .07 and .09 (mg / dm^3). After a log transform,shown below, it appears to be closer to normally distributed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7782  1.3424  1.5798  1.5638  1.7924  2.4609

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Density appears normally distributed, with a mean of .9967 and the bulk of results falling between .9956 and .9968 (g / cm^3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH appears normally distributed, with a mean of 3.311 and the bulk of results falling between 3.21 and 3.4.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Sulphates is right skewed. The bulk of sulphate data lies between .55 and .73 (potassium sulphate - g / dm3). After a log transform,shown below, it appears to be closer to normally distributed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.4815 -0.2596 -0.2076 -0.1934 -0.1367  0.3010

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Finally alcohol does not appear to be normally distributed. Alcohol is slightly right skewed, but does not look like it would benefit from a log transform. The bulk of the results fall between 9.5 and 11.10 (% by volume)

Univariate Analysis

What is the structure of your dataset?

The dataset covers 1,599 observations across 12 variables. All variables are real, continous numbers except for quality which is factor (integer) levels 0 (low) through 10 (high). Of note is the fact that multiple values of quality are not present in the dataset (0, 1, 2, 9, and 10). Other variables present are related by definition (see above units) and will likely have high covariance with each other.

What is/are the main feature(s) of interest in your dataset?

The primary feature for my analysis will be quality, but I will also explore some of the covariance of other properties related by unit definition and as indicated by a correlation plot

What other features in the dataset do you think will help support your analysis?

I believe alcohol, citric acid, sulphates, and sugar will support my analysis as these are the qualities I immediately recognize from my knowledge of wine. I am sure thought that in exploring the data I may discover new connections I wasn’t aware of before.

Did you create any new variables from existing variables in the dataset?

I did not create any new (composite) variables from the dataset, but did log transform some of the data in my initial exploration.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the form of the data?  

If so, why did you do this?

When exploring the data I log transformed a couple of skewed distributions which included residual sugar, chlorides, free sulfur dioxide, and total sulfur dioxide. The log transform ideally reduces tail size / impact which can affect regression analysis in the future.

Bivariate Plots Section

We can see from this correlation plot that there are many strong correlations that can be explored. We will graph and discuss some of these pairs as listed below. Density: Residual Sugar, Fixed Acidity, Citric Acid, Alcohol Fixed Acidity: Citric Acid, Volatile Acidity Free Sulfur Dioxide, Total Sulfur Dioxide Quality: Alcohol, Sulphates, Volatile Acidity, Citric Acid, Residual Sugar

There is a mild positive correlation between density and residual sugar

There is a moderate positive correlation between desnity and fixed acidity.

Based on the correlation chart above, there is a mild positive correlation between citric acid and density. From the chart, we can see thought that this relationship is quite noisy.

Finaly there is a moderate negative correlation between density and alcohol. This makes sense from a chemical perspective as alcohol is less dense than water.

There is a moderate-strong correlation beetween fixed acidity and volatile acidity although the data is quite noisy.

There is a slight negative correlation between volatile acidity and citric aci d but the data is quite noisy.

There is a moderate positive correlation between free and total sulfur dioxide , which makes some degree of intuitive sense given that the measures are related

Quality is clearly increasing with alcohol %.

Quality is mildly increasing with sulphates, although it is clear that there are outliers in sulphates at each quality level, particularly quality 5-6. This could be indicative that extremely high values of sulphates (above a particular threshold) could potentially be a negative influence on wine quality.

Quality increases as volatile acidity decreases. This interaction makes sense as volatile acidity is what imparts a vinegar taste to wine.

Quality is increasing with citric acid. This interaction makes sense as citric acid is perceived to add “freshness” and a pleasant taste to wines.

Quality is flat with residual sugar, although it is clear that there are outliers at each qualty level. It appears that the highest (and generally densest) sugar outliers are in the middle of the quality range, which may represent an insight into flavor preferences for high quality red wines (ie. a preference for “dry”wines)

There appears to be a mildly parabolic relationship between quality and both free and total sulfur dioxide, with larger tails for each of the sulfur measures in the middle quality range (5-7).

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Quality is positively correlated with alcohol, citric acid, and sulphates while it is negatively correlated with volatile acidity. Interestingly enough, residual sugar is not strongly correlated with quality, but it appears that the middle of the quality range has the largest high sugar outliers (ie. larger tails). This could indicate that overly sweet wine may not rate as well as similar wines that have less sugar. Quality is mildly parabolic with both sulfur measures, with similar tail behavior as sugar in the 5-7 range.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

There was nothing terribly surprising in the elements that I investigated, but it was reassuring to confirm certain relationships that made chemical sense based on variable descriptions ie. density vs alcohol and total vs free sulfur dioxide. It might be interesting to investigate whether more sugar (holding other positive traits equal like alcohol, citric acid, and possibly sulphates) has a potentially negative relationship with quality.

What was the strongest relationship you found?

Alcohol and density, is a strong relationship primarily due to the chemical properties of alcohol. As for quality, citric acid and alcohol look strongly related to quality.

Multivariate Plots Section

In plotting alcohol percentage and citric acid together with quality as a color gradient, it is clear that more of the high quality wines fall in high alcohol and high citric acid regions.

In plotting alcohol percentage and sulphates together with quality as a color gradient, it is clear that more of the high quality wines fall in high alcohol regions, while sulphates appears to have less of a linear relation with quality and more of a parabolic one. (Quality seems to drop off, outside of a specific band of sulphate quantity)

Plotting volatile acidity with free sulfur dioxide with quality as a color gradient shows declining quality with increases of volatile acidity, less so free sulfur dioxide. It does appear though that wines can still rank 7+ even with high(er) free sulfur dioxide. From the bivariate plot above, we do see that free and total sulfur dioxide are positively related, it is likely that better wines with high free sulfur dioxide fall on the lower range of total sulfur in the sample for those particular free sulfur levels.

> Here we have a 3d graph that examines the impact sugar has on quality for different levels of citric acid and alcohol content. It is clear that for similar levels of citric acid and alcohol content, increasing sugar content appears correlated with lower wine quality. Because the bulk of the sample falls in the “middle” range of quality 5-7, it would be good to have a broader sample to see if this holds across “slices” of citric acid and alcohol.

In this graph, using turntable rotation to isolate “faces” of the graph, we can compare sulphate’s interactions with alcohol, citric acid, and quality. It looks like there is a sulphates band, within which quality ratings are moderately higher. When the sulphates are higher or lower than that band, quality tends to decline all else equal.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

It is clear that the variables that correlate positively with quality (citric acid, alohol) strengthened each other from the first plot in the multivarite section. The negative correlation variables examined did not (free sulfur dioxide and volatile acidity.)

It also appears that higher residual sugar may contribute to declining quality compared to wines with similar citric acid and alcohol levels but lower sugars. Additionally, while sulphates appeared slightly positively correlated in the 2d boxplot of sulphates vs quality, the 3d plot appears to indicate that it is more likely that there is an “optimal” range of sulphates in wine. Above or below this range corresponds with a drop off in quality.

Were there any interesting or surprising interactions between features?

I think the residual sugar and sulphate properties discussed above are interesting. It points to the necessity of careful interpretation and exploration before drawing conclusions to best understand covariates. This is especially important before performing any modeling as it could lead to incorrect or incomplete understanding.


Final Plots and Summary

Plot One

Description One

This set of four plots shows some of the relationships between quality and variables in the data set. We note positive correlations with alcohol, citric acid and sulphates; we also note a negative correlation with volatile acidity. These plots are interesting as they are clearly visible interactions amongst some of the 11 variables vs quality.

Plot Two

Description Two

This graph examines the interaction of sugar, alcohol and citric acid with quality. In examining the box plots of sugar vs quality, I noticed that high sugar outliers were more common in the 5-7 quality range, and could affect quality negatively. To explore this, I needed a plot where I could compare and contrast wines with similar citric acid and alcohol profiles with differing varying levels of sugar. While the data is very clustered, and sampling at both the top and bottom of the quality range is low, it does appear that higher sugar correlates with lower ratings within similar citric acid and alcohol profiles. This is interesting because it indicates it might be worth exploring various thresholds to define new factor variables where sugar is high, to improve and explore predictive modeling down the road.

Plot Three

Description Three

This graph examines sulphates, citric acid, and alcohol and their combined interactions on quality. By turning this graph using the turntable function in the interactive menu, one can look at different angles and highlight various points to better understand the interaction of sulphates with quality and the other variables present. By moving the graph into a “2d” visual, one can see that there appears to an optimal range for sulphates to achieve good quality. Outside of this band, wine quality appears to drop off across all alcohol and citric acid ranges. This shows that while the initial boxplot may have indicated a positive correlation between sulphates and quality, that the actual relationship may be slightly more complex.

Reflection

I enjoyed exploring the red wine dataset, and found learning R to be relatively straightforward. I found the data mostly easy to work with, and was able to learn about functions in R to expedite some of my work. I also learned more about the plotly package, and was able to exploit the 3D scatterplot in the package to better explore how sulphates and residual sugar affect quality in the sample. Based on feedback in my project review, I was able to improve the graphing of factor levels in quality, and greatly improved the presentation of my multivariate graphs by setting discrete color levels with sensible color palettes.

One of the challenges of the dataset was interpreting data that didn’t have clear positive and negative correlations with quality, like sugar and sulphates. The experience emphasized the importance of looking beyond common basic measures like quartiles and means. I could see 3d graphing and examining tail behavior as an important part of my future EDA and modeling efforts. I also found this portion of the analysis challenging, as there is limited data at the high and low end of the quality ranges, making it hard to visualize certain interactions.

Idea for Future Work

In the future, I would like to explore predicting wine quality using basic classification models. I could split quality levels into various buckets (ie. good, average, bad) to solve the balancing issues. From there I could balance the data by using standard sampling methods. Additionally, I could transform some of the variables to better quantify identified behavior. For example, creating a centered and squared version of sulphates could better identify the “optimal quality range” behavior I identified. Similarly a transformation that seperates sugar outliers at similar levels of citric acid and alcohol may be useful. After that work, I could apply a random forest based method to determine the importance of the variables in the dataset and further my analysis.